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Abstract 

The Reliable Multicast Protocol (RMP) provides a unique, group-based model for distributed 
programs that need to handle reconfiguration events at the application layer. This model, 
called membership views , provides an abstraction in which events such as site failures, network 
partitions, and normal join-leave events are viewed as group reformations. RMP provides access 
to this model through an application programming interface (API) that notifies an application 
when a group is reformed as the result of a some event. RMP provides applications with reliable 
delivery of messages using an underlying IP Multicast [12, 5] media to other group members in 
a distributed environment even in the case of reformations. A distributed application can use 
various Quality of Service (QoS) levels provided by RMP to tolerate group reformations. This 
paper explores the implementation details of the mechanisms in RMP that provide distributed 
applications with membership view information and fault recovery capabilities. 


1 Introduction 

Many distributed programs need to be reconfigured while continuing to provide ser- 
vices, but how a system is reconfigured is often specific to a particular application. 
Therefore, any application programming interface (API) to a distributed environment 
should provide an abstract model of reconfiguration because applications will differ in 
how they handle changes in a distributed environment. For example, teleconferencing 
applications should be highly resilient to site failures or network partitions because such 
failures can be modeled as normal join-leave changes to the group of conference partic- 

*This work is supported by NASA Grant NAG 5-2129 and NASA Cooperative Research Agreement NCCW-0040. 
More information pertaining to RMP can be found at http://research.ivv.nasa.gov/projects/RMP/RMP.html 
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ipants. However, distributed database systems that require atomic transactions will be 
highly sensitive to such failures. In either case, the application must decide what levels 
of fault tolerance it needs and how to handle changes to other sites and the network in 
order to continue service. 

The Reliable Multicasting Protocol (RMP) [10, 2, 3] is a unique, high-performance 
protocol developed at West Virginia University in cooperation with NASA that will 
soon be presented for consideration as an Internet standard and is being used currently 
in many network software applications. RMP presents an API that provides applica- 
tions with a simplified model of dealing with complex changes in distributed, group 
communication environments. RMP provides a programming abstraction, called mem- 
bership views, for handling reliability, resiliency, fault recovery, and ordering issues in a 
distributed application. 

RMP is based on an algorithm originally developed for reliable delivery of data in 
broadcast-capable, packet-switching networks [4]. The original algorithm allows sites 
in a packet-switching network to establish a token ring for distributing responsibility 
for acknowledgments. A single token is passed from site to site around the ring and 
only the holder of the token (called the current token site) needs to acknowledge certain 
data packets. RMP has high-performance characteristics because acknowledgments 
themselves are multicast to all other token ring sites. This approach orders the data 
packets consistently across all sites and provides a means of passing the token to a new 
token site. 

When a site gets the token (i.e., it becomes the current token site), it multicasts an 
acknowledgment if and only if it has seen all data packets since the last acknowledg- 
ment it received. The token is passed in the multicast acknowledgment packet. The 
acknowledgment packet includes the source and sequence numbers of data packets it is 
acknowledging. This allows each site to detect if any packets are missing. A site will use 
negative acknowledgments to request retransmission of any missing packets. When all 
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packets since the last acknowledgment received have been received by the current token 
site, then that site can multicast its acknowledgment and thus pass the token to the 
next site on the ring. When a token site sends an acknowledgment, it is guaranteed that 
all data packets since it last held the token have been received by all sites. The sender 
of a packet assumes that all messages since it last had the token have been received by 
the other sites within a requested quality of service (QoS) level. A packet is marked 
delivered if and only if it satisfies its QoS level of delivery. The QoS level allows for 
resilience of the protocol in the presence of site failures and network partitions. In the 
case of failures, the token ring reforms itself around the failed site. In the presence of 
persistent failures, the application program using RMP must decide to degrade the QoS 
level or try again. 

RMP differs from previous reliable broadcast protocols in that an acknowledgment 
packet may acknowledge an arbitrary number of data packets. Previous protocols spec- 
ified that each data and acknowledgment packets have a one-to-one relationship. Our 
approach, however, improves throughput in networks with sporadic losses. Each site in 
a token ring maintains a data structure called an Ordering Queue (OrderingQ) in which 
acknowledgments and data packets are organized based on timestamps. An Ordering 
Queue is consistent if and only if there are no missing data packets for pending acknowl- 
edgments. A missing packet will appear as an empty slot in the OrderingQ that must 
be filled. When a site becomes the token site, all empty slots in the OrderingQ since the 
last acknowledgment received must be filled. For example, in Figure 1 we show 3 sites 
of a token ring and a global sequence of events. No site has complete knowledge of this 
sequence. It is only shown to illustrate a possible scenario. Next to each site is a list of 
the messages sent by that site. First, site A sends a data packet signified as Data(A,l) 
where the first parameter is the sending site and the second is the sequence number of 
the message. Sequence numbers are unique to individual sites. Second, site B sends a 
data packet (Data(B,l)). The initial token site is site B who then acknowledges both 
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Figure 1: An example of RMP operation 


data packets and passes the token to site A. The Ack((A,l),(B,l),A,l) message contains 
a list of source identifiers and sequence numbers for two packets, followed by the next 
token site and the timestamp of the acknowledgment. 

In this example, we assume that site C missed the data packet Data(B,l). Site C 
realizes it has missed a packet after it receives the Ack((A,l),(B,l),A,l) message. It 
knows this because the Data(B,l) packet is listed in the Ack message from B. Each 
slot in an OrderingQ corresponds to a timestamp whether explicit in the case of Ack 
messages or implicit in the case of Data packets. Site C will multicast a Nack message to 
request the data packet to fill the one slot in its OrderingQ at timestamp 3. Any other 
site in the ring should respond to this Nack with the requested missing packet. In this 
example, Site B responds to the Nack by retransmitting the Data(B,l) message. The 
sequence number identifies this message uniquely to distinguish it from new messages. 
If a period passes during which no data packets are transmitted, a site will time-out and 
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subsequently send a multicast NULL Ack packet. In our example, Site A sends a NULL 
Ack with timestamp 4 after waiting. This NULL Ack passes the token to site C. After 
the site B retransmits the packet Data(B,l), site A multicasts another data packet with 
sequence number 2 as Data(A,2). Since site C’s OrderingQ is consistent, it multicasts 
an acknowledgment of the Data(A,2) packet and passes the token to site B. The global 
ordering of events is an artifact of the timestamps and may or may not reflect the actual 
order of events. This decentralized notion of ordering, called global synchrony , allows 
applications to synchronize their activities based on group events instead of a single, 
centralized authority. 

2 RMP Fault Model 

RMP is a modification of a Post-Ordering Rotating Token algorithm originally devel- 
oped by Change and Maxemchuk[4]. A pass of the token around the ring provides 
ordering notification to all group members. The token itself acts as a combination 
of positive and negative acknowledgments to group members for message ordering and 
reliable delivery without the overhead of large numbers of unicast acknowledgment mes- 
sages. A message is delivered, or stable, if the token is rotated to each of the group 
members in turn. Once the token has made a complete circuit, it is guaranteed that 
packets acknowledged previous to the start of the circuit have been received by all group 
members at that moment. 

The actual modifications from the original Chang and Maxemchuk algorithm are 
quite extensive [11, 10]. Two of the most significant areas of redefinition and extension 
are in the categories of fault recovery and group membership. Originally, the algorithm 
only dealt with steady state operation and a very restrictive fault recovery process, 
i.e. no attention was played to changing the number of members during operation or of 
relaxing the fault recovery process to allow applications with less stringent requirements 
to continue operation. RMP expanded this by adding the ability to change a groups 
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membership dynamically so that members can join and leave a group, integrating this 
ability into the protocol operation smoothly, and using the concept of membership views 
to adjust the fault recovery process on an individual group basis. 

2.1 Membership Views 

Applications using RMP will receive asynchronous events from RMP layer that indi- 
cate delivery of messages, some exceptional conditions being met, or a change to the 
group in some way. A membership view is a snapshot of a group’s current membership 
information that is passed up to the application. This snapshot is part of the globally 
ordered sequence of events that all group members perceive. All group members receive 
the same sequence of events, both messages and membership views, regardless of the 
underlying event sequence imposed by an unordered and unreliable network. The mem- 
bership view concept allows RMP to provide a virtually synchronous execution model 
to applications using it. Virtual synchrony was defined by Ken Birman from his work 
on the ISIS system[13, 1]. “Intuitively, this means that the user can program as if the 
system scheduled one distributed event at a time”[13]. This approach greatly simplifies 
distributed application development and provides a convenient service upon which con- 
figurable systems can be built. The original Chang and Maxemchuk algorithm fails to 
provide virtual synchrony due mainly to its lack of membership changes, however, the 
algorithm also violates virtual synchrony by allowing members to be added during the 
fault recovery process. 

A change in the membership view is an event that returns the new membership view 
and notifies the application as to the type of event that took place. Some of the more 
interesting and useful membership view change types are: 

• A member has been added to group (or formed own group) 

• A member has been removed from group 

• A member received a lock 1 

1 RMP provides 256 mutually exclusive locks for members to use. 
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• A member was denied a lock 

• A member released a lock 

• A member changed the Minimum Size Requirement (MSR) (see Section 3.3) 

• Some other member change occurred (add, remove, lock change, etc.) 

• A fault was detected and recovery is complete 

• Group scoping was changed (i.e., a change in the IP Multicast Time-To-Live (TTL) 
field) 2 

When a membership change occurs, an application is notified that a group change 
has occurred, what kind of operation occurred (join or leave), and the status of the 
group after the change. The change can be categorized into three classes. First, a 
change may be a local change that affects only the notified member. Local changes are 
the results of requests such as asking to join a group, asking to be removed from a group, 
or requesting a change to a lock. Remote changes are changes that affect other sites. 
These include local changes to other group members, but to an individual application 
the changes appear to be remote. Finally, global changes affect more than one member 
of a group. These are changes such as change of group scoping, notification of fault 
recovery completion and the result of the recovery process. 

RMP delivers a membership change event to the application upon the completion 
of the fault recovery process. This process, called reformation , may be successful or 
unsuccessful depending on extent of site failures, partitions, or leave events. At the end 
of the reformation process, the result of the reformation is delivered as an event to the 
remaining group members. Thus an application can examine the membership view and 
the result of the fault recovery process in order to decide what actions it must take to 
remain operational. In addition to notification of a fault, RMP allows the application 
to specify message resiliency on a per message basis as well as allowing each member to 
have a “vote” on the minimum size of a group to be allowed to proceed after a failure. 

2 RMP uses the IP Multicast scoping mechanism of Time-To-Live (TTL) for controlling the propagation of RMP 
multicast traffic to group sites on a Wide-Area Network such as the Internet. 
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2.2 RMP Failure Assumptions 


Key to any fault recovery and detection system is defining under what circumstances 
and assumptions the system is assumed to operate. The result of any fault recovery 
operation can be: (1) a success, (2) an atomicity violation, or (3) a failure. An atomicity 
violation occurs when the fault recovery process can not attain a common sequence of 
events between members of a group. In practice this situation is very rarely encountered, 
but it is possible. Atomicity violations can occur because causally related events may 
become misordered due to buffering or Internetworking constraints. RMP makes three 
assumptions pertaining to failures. These are: 

• A site failure means the site stops processing. The site does not interject corrupting 
information into a group. 

• A message failure can be the result of an overly full buffer at either the receiver or 
the sender, or it may be the result of a transmission failure. (<C 1% of packets on 
current local and wide-area networks) 

• A failure is detected by a group when communication with the group and a site 
fails after R attempts. R must be chosen such that a failure is mistakenly detected 
infrequently, but large enough to provide timely notification of actual failures. 

Additionally, RMP addresses the first assumption by supporting cryptographic au- 
thentication. This does not completely remove the assumption, but it provides a mech- 
anism whereby corruptive sites can be filtered if they can be detected. This method also 
provides protection from unknown sites that may try to corrupt RMP operation. How- 
ever, this approach is only as secure as the means by which the members can retrieve 
the authentication keys and the trustworthiness of the other mechanisms involved. 

2.3 Fault Detection 

As mentioned above under the failure assumptions, RMP performs failure detection 
using a series of retransmissions of messages. If a certain amount of retransmissions are 
attempted without a reply being seen, then the fault recovery process is initiated. 

A duality between flow control and fault detection exists that is important to men- 
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tion. RMP’s flow control mechanism uses a slight modification of an adaptive flow 
control scheme [6]. This scheme dictates retransmissions rates and timeouts between 
retransmissions to avoid saturating the network during congestion. The adjustment in 
retransmission periods has a direct bearing on fault detection in RMP. This aspect of 
RMP operation is continually undergoing experimentation and analysis, however pre- 
liminary experiments have shown that an R value set to 10 (for 10 retransmission at- 
tempts), and capping the maximum retransmission period to 2 seconds provides timely 
notification on a LAN 3 . In WAN environments, the R value must be set higher (to 
well over 30 or more). Changes to the maximum retransmission period for WAN groups 
have shown that 2 seconds 4 works best as long as the maximum packet sizes are also 
kept small in order to reduce, or eliminate, fragmentation. Higher amounts of fragmen- 
tation increase the likelihood of a packet being dropped due to a segment being lost. 
Currently, RMP can adjust the R value based solely on the group scoping value (i.e., 
the IP multicasting packet TTL value). Thus it is easy to determine what R value to 
use based on whether the RMP group stretches over multiple LANs or is based on a 
single LAN. Other adaptive schemes could also be used to dynamically configure the 
R value based on previous attempts and other flow control variables. However, this 
approach must be carefully examined so that the R value does not grow too large to 
make fault detection times too large 5 . 

2.4 Selection of Resiliency and Fault Tolerance Levels 

An RMP application may choose message ordering and resiliency semantics on a per 
message basis. These semantics are defined as RMP Quality of Service ( QoS) values that 
range from unreliable and unordered to totally ordered and totally resilient. RMP QoS 

3 10 attempts at 2 seconds per attempt implies a maximum detection time of 20 seconds 

4 30 attempts at 2 seconds per attempt implies a maximum detection time of 60 seconds 

5 As some would say happens in TCP. 


9 


is organized into a hierarchy that begins with varying levels of ordering and progresses 
into resiliency. Message resiliency is based on assurances that RMP places on how 
many group members have received a given message using properties of the protocol 
operation. In order to meet any resiliency guarantees, the message must also meet total 
ordering guarantees. Thus all resilient messages are, by definition, totally ordered. K 
resiliency assures that K members of a group receive the message. The value of K may 
range from 1 to the size of a group, N. Majority resiliency is the special case where 
K = [N / 2J + 1, and total resiliency corresponds to the case where K = N. 

In RMP’s execution model, message delivery and message resiliency are separate 
causally related events. Message delivery is based on ordering alone, while resiliency 
is based on the number of token passes after total ordering is met. This separation 
of delivery semantics from resiliency notification allows applications to design efficient 
transaction and persistent object systems. In addition, RMP allows the application to 
request notification when a message has become stable. This notification is another 
event that the application may use to help facilitate its operational correctness with 
respect to group consistency. 

When a new member is added to an RMP group, the member, in effect, casts a 
vote for the minimum size it requires to be maintained after a failure. The actual 
minimum size of a group is the maximum of the votes from all members. This Minimum 
Size Requirement (MSR) determines the fault tolerance level used by RMP during a 
reformation. A member may change this vote at any time during normal operation. 
Such a change is a change to the membership view and the application is notified of 
this change. The levels of fault tolerance closely reflect the levels of message resiliency 
discussed above. It is highly desirable for an application to use message resiliency and 
a specific fault tolerance level to its advantage to provide assurances it may need, i.e. 
majority resiliency combined with majority fault tolerance assures that if a fault is 
recoverable, then someone in the group has the message if its resiliency was met. The 
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selectable levels of fault tolerance are: 

• K Fault Tolerance - Up to N - K members may fail, N is the group size. 

• Majority Fault Tolerance - Up to [N J 2J - 1 members may fail. Defining how this 
majority is calculated may be done in one of two ways 6 : 

- Optimistic - [N/2\ + \, where N is defined as the number of members currently 
in the group. 

- Pessimistic - [N/ 2J + 1, where N is defined as the number of current members 
plus members who have left since the last stability point. 

• No Fault Tolerance - No members may fail. 

If the desired fault tolerance level is not met after a reformation, then the reformation 
is classified as a failure and the application is notified. At this point, the application 
must then decide how to re-form or re-join the group and continue operation. A common 
scheme for doing this is to use a logging facility to synchronize group members to a 
specific point that they all agree upon, re-form or re-join the group, and then- continue 
operation. 


3 Fault Recovery Process 

The original Chang and Maxemchuk algorithm [4] presents a very high-level and re- 
stricted reformation process that is not very applicable in many domains. RMP expands 
on this by relaxing some requirements, specifying the algorithm using state tables [2, 3], 
and accounting for other RMP features, such as Multi-RPC, security/authentication, 
and dynamic membership changes. RMP does not allow members to be added through 
reformation. This was allowed in the original algorithm, however it violates virtual 
synchrony. 

The fault recovery process must terminate and be free of livelock. This property 
is absolutely critical for continuous operation especially when changes occur during the 

reformation process itself. At each stage of the reformation process, secondary point 

6 Currently, RMP uses the optimistic method, however, a formed proof is still underway to determine if this 
correct or not. 
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failures must be detected. In the state model this is done by using the normal fault 
detection methods on the fault recovery messages and/or providing a timeout so that 
the state is eventually changed. During the fault recovery process, the members of the 
group attempt to come to a common synchronization point that indicates the ordering 
that the fault will take in the global ordering of events. This point must be after all 
events that have already been ordered by all members. This ordering allows faults to 
actually be seen as just other events that occur and can be taken into account by the 
application. 

3.1 Two-Phase Commit 

The RMP fault recovery process is a Two-Phase Commit procedure. The member of 
a group that first detects a fault is called the Reform Site for that reformation. The 
reform site is responsible for coordinating and moderating the reformation process. The 
other members of the group are then classified as Slave Sites. Slave sites are passive and 
reactive participants in the reformation process. The two phases are described below. 

In Phase 1, the Reform Site multicasts notification of failure to the group while 
the Slave Sites unicast their responses to the Reform Site. These responses indicate a 
synchronization point and desire to participate in the reformation process. The Reform 
Site then determines the membership view for the reformation. This will consist of a 
subset of the set of sites previously in the group before the fault was detected, i.e. if 
S is the set of sites before the fault, then S' C 5, where S' is the set of sites after the 
fault. The sites not in S' are sites that are considered to be dropped. The Reform 
1 Site then determines the synchronization point common to all members of the group. 

If this point is not reachable then an atomicity violation has occurred. If the MSR for 
the group is met and an atomicity violation did not occur, then the membership view 
is defined to be valid , otherwise, the view is assumed to be invalid. Thus an atomicity 
violation indicates an invalid view regardless of meeting MSR or not. 
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In Phase 2, the new membership view is installed at the surviving sites. The Re- 
form Site multicasts the membership view and the Slave Sites unicast their response to 
acknowledge reception of the membership view. The Reform Site receives confirmation 
from all reformation participants of reception of membership view. If the new mem- 
bership view is valid, then all the sites return to normal operation once reception of all 
confirmations is received and notify application of fault and successful fault recovery. If 
the membership view is invalid, then all sites return to the RMP “idle” state and notify 
the application of the fault and that fault recovery failed. 1 . If the membership view is 
valid and confirmation from one or more members does not arrive within a retransmis- 
sion cycle of the membership view, then the reform site assumes a secondary failure, 
aborts the current reformation, and the process begins again. 

The process is optimized so that when a failure is detected erroneously, RMP does 
not spend vast amounts of time processing needless information. This optimization is 
performed by short-circuiting some steps if all sites are heard from. Even more drastic 
levels of optimization could be performed if the fault cause could be isolated. However, 
this is very difficult to perform generically with RMP. 

3.2 Aborting a Reformation 

In some cases it is necessary to abort the current reformation process and begin again. 
This is performed in cases where multiple reformations are detected, or a secondary 
failure is detected. Multiple reformations can be detected by the Reform or Slave 
sites when fault notification is originated by members other than the Reform Site. 
Secondary failures of the Reform Site can be detected by the slave sites through the use 
of retransmission cycles for the responses they unicast to the Reform Site. Secondary 
failures of slave sites can be detected through retransmission cycles used in installing 
the membership view by the Reform Site. 

7 In effect, this will disband the RMP group. 
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When a Reform or Slave Site sees a condition that suggests that reformation should 
be aborted, that site then enters an Abort Reformation state and notifies the group to 
abort the current reformation. At any time during reformation, if a site sees this type 
of notification, then it must also enter the Abort Reformation state. 

Members that enter the abort reformation process set a random timeout so that 
deadlock in the process is avoided. The first site to have this timeout expire, then 
becomes the new reformations reform site and starts the reformation process. While 
in the abort reformation process, if a site detects a new reformation beginning, it then 
participates as a slave site 8 . 

3.3 Return to Normal Operation 

Returning to normal operation is of high importance to any fault tolerant system. RMP 
allows operation to continue in all cases, but the application must examine the result of 
the reformation to assess the correct behavior. This may mean reforming a group and 
rejoining its members (in the case of atomicity violations and failed reformations), or 
continued processing (in the case of successful reformations). 

Because RMP allows the application to specify a desired MSR, cases can arise where 
network partitions can cause multiple groups to partition away and continue operation. 

Once this occurs, RMP operation does not allow the new sub-groups to rejoin if the 
network is repaired. This is achieved through the same mechanisms that RMP uses 
to allow multiple groups to exist on a given multicast address. Each packet contains 
an identifier that explicitly identifies that packet to belong to a specific group. This 
identifier is called a Token Ring ID, or TRID. A TRID is a triple guaranteed to be 
unique in space (IP address and UDP port) and time (12 hour epoch timer). The 
TRID is changed on a regular basis, i.e. every 45 minutes, and is also changed for every 
8 Several details here are elided for brevity, including using version numbers for membership views to determine 
viability of reformations. The full details are given in [11, 14, 15] 
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attempted reformation. Thus two reformations that occur on a partitioned network 
can be filtered based solely on TRIDs. Allowing partitions to come back together 
can more easily be done at a higher level than RMP. However, some work with other 
reliable broadcast/multicast protocols have produced interesting methods of rejoining 
partitions [7, 9]. Other methods of filtering have also been suggested [8]. It is our belief 
that applications can benefit from these works to expand RMP’s fault recovery process 
to include successful recovery from atomicity violations and the rejoining of partitions. 

4 Conclusions and Future Work 

RMP provides a mechanism for continuous, reliable, fault-tolerant, and atomic delivery 

of messages in a multicast media even in the event of site failure, network partitions and 

■ - ■ • . — ^ 

normal join-leave events. In addition, RMP provides an event-based API that presents 
the application with a powerful and intuitive distributed programming model. This 
model allows the application to make educated decisions about dynamic group recon- 
figurations of the application. RMP’s fault recovery mechanisms allow the application 
to tailor itself to any desired level of fault tolerance and message resiliency without 
mandating that the application explicitly perform these functions itself. 

Several important issues remain to be investigated including the possibility of con- 
tinued operation using the group after an atomicity violation, abstractions for defining 
semantics of a “majority” (pessimistically or optimistically) for an application, and ef- 
fective flow control that is orthogonal (or at least alternatively complementary) to fault 
detection. Isolation of faults in order to optimize the fault recovery process seems to 
hold promise, but RMP’s operation model allows special cases to exist where this issue 
becomes very difficult to tackle effectively. 
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